-
Notifications
You must be signed in to change notification settings - Fork 12.4k
model : add PLaMo-2 model #14560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model : add PLaMo-2 model #14560
Conversation
This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot
* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.
Co-authored-by: Sigbjørn Skjæret <[email protected]>
…mo2_session to follow the other tokenizer implementations
c805d75
to
eea696e
Compare
Co-authored-by: Georgi Gerganov <[email protected]>
We can't expect users to do this, I think the better option would be to add this token as EOT at conversion. |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vocab needs to be padded or else loading embedded tokens will fail.
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
That's right. Thank you for the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tokenizer is super slow, quite possibly something wrong, please check it out, but test-tokenizer-0
passes.
Ah, I found that building Aho tree is performed every time |
There's time to fix it now. :) |
Thanks, I think it's fixed with 6921534 |
I will add vocab files to HF for CI in a week or so (so as not to break CI for everyone not in sync with master). |
This PR supports PLaMo2 model in llama.cpp, which was also requested on a related discussion thread: #13874. This model uses a custom-implemented tokenizer, so this PR includes both the model itself (which uses an architecture combining Mamba and Attention, similar to Jamba) as well as implementing the new custom tokenizer.
Based on #7531
How to check if the plamo-2-translate works with this PR. First, retrieve the model itself by:
Then, I needed to modify the
tokenizer.jsonl
to pad some meaningless vocabs to align the vocabulary size to what is specified inconfig.json
, namely it should be 100032 by using this script:Next, convert the model into gguf by the following command:
Then build binaries as follows:
and finally, I successfully run the plamo-2-translate model as follows:
./release/bin/llama-cli -m plamo-2-translate.gguf -p "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=English\nHello, how are you?\n<|plamo:op|>output\n" -no-cnv --verbose-prompt --no-warmup -sp
intermediate outputs
Output:
Seems correctly working!